This project utilizes publicly available data to recommend 10 similar suburbs in Australia to a given suburb. This could be useful for people trying to decide where to work, rent or buy property. For example, it could be used on property search websites; a person searches for property in a given suburb and is provided a list of similar suburbs to consider buying in. There are three primary data points used to determine how similar other suburbs are to the given suburb;

Foursquare Places API - current data on the venues (cafes, restaurants, parks, amenities etc) most often found in the suburb
Australian Bureau of Statistics API – data on the cultural background (i.e. ancestry) of the people in the suburb
RealestateVIEW.com.au – Age and income demographics of the suburb

For example, given the suburb ‘Brunswick West’ (my home suburb in Melbourne) the engine provides 10 suburbs which have similar venues (typically cafes, pubs and Middle Eastern restaurants), similar age and income residents (older middle class), and similar cultural heritage (English, Italian, Greek).

It utilizes K-Means clustering from Scikit-learn to cluster venue and cultural data, the pandasdmx package to interface with the Australian Bureau of Statistics API, and the Beautiful Soup package to scrape demographic information from the web.

In [9]:

#pandasdmx is used when calling the Australian Bureau of Statistics API
!pip install pandasdmx

Collecting pandasdmx
  Downloading https://files.pythonhosted.org/packages/e7/5d/7e29616b0376d0a15eb3013c909c4e9021b6c6e244ef1b48817b4d0c4d46/pandaSDMX-0.9-py2.py3-none-any.whl (45kB)
     |████████████████████████████████| 51kB 13.8MB/s eta 0:00:01
Requirement already satisfied: lxml in /opt/conda/envs/Python36/lib/python3.6/site-packages (from pandasdmx) (4.3.1)
Requirement already satisfied: pandas in /opt/conda/envs/Python36/lib/python3.6/site-packages (from pandasdmx) (0.24.1)
Collecting jsonpath-rw (from pandasdmx)
  Downloading https://files.pythonhosted.org/packages/71/7c/45001b1f19af8c4478489fbae4fc657b21c4c669d7a5a036a86882581d85/jsonpath-rw-1.4.0.tar.gz
Requirement already satisfied: setuptools in /opt/conda/envs/Python36/lib/python3.6/site-packages (from pandasdmx) (40.8.0)
Requirement already satisfied: requests in /opt/conda/envs/Python36/lib/python3.6/site-packages (from pandasdmx) (2.21.0)
Requirement already satisfied: python-dateutil>=2.5.0 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from pandas->pandasdmx) (2.7.5)
Requirement already satisfied: numpy>=1.12.0 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from pandas->pandasdmx) (1.15.4)
Requirement already satisfied: pytz>=2011k in /opt/conda/envs/Python36/lib/python3.6/site-packages (from pandas->pandasdmx) (2018.9)
Requirement already satisfied: ply in /opt/conda/envs/Python36/lib/python3.6/site-packages (from jsonpath-rw->pandasdmx) (3.11)
Requirement already satisfied: decorator in /opt/conda/envs/Python36/lib/python3.6/site-packages (from jsonpath-rw->pandasdmx) (4.3.2)
Requirement already satisfied: six in /opt/conda/envs/Python36/lib/python3.6/site-packages (from jsonpath-rw->pandasdmx) (1.12.0)
Requirement already satisfied: idna<2.9,>=2.5 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from requests->pandasdmx) (2.8)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from requests->pandasdmx) (2019.6.16)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from requests->pandasdmx) (3.0.4)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from requests->pandasdmx) (1.24.1)
Building wheels for collected packages: jsonpath-rw
  Building wheel for jsonpath-rw (setup.py) ... done
  Stored in directory: /home/dsxuser/.cache/pip/wheels/5c/00/9a/82822db383c2d96dcebf839786665a185f92d37e5026f9806f
Successfully built jsonpath-rw
Installing collected packages: jsonpath-rw, pandasdmx
Successfully installed jsonpath-rw-1.4.0 pandasdmx-0.9

In [10]:

import pandas as pd
import numpy as np
import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import requests
from sklearn.cluster import KMeans
from pandasdmx import Request
from bs4 import BeautifulSoup

Get suburb and postcode data¶

List of Australian postcodes and suburbs are available online on many websites. I've replicated the file from http://www.corra.com.au/australian-postcode-location-data/ on Github

In [11]:

filename = "https://raw.githubusercontent.com/DataistDogma/Coursera_Capstone/master/Australian_Post_Codes_Lat_Lon/Australian_Post_Codes_Lat_Lon.csv"
postcodes = pd.read_csv(filename)
print("Postcodes shape: ", postcodes.shape)
postcodes.head()

Postcodes shape:  (16080, 7)

Out[11]:

	postcode	suburb	state	dc	type	lat	lon
0	200	AUSTRALIAN NATIONAL UNIVERSITY	ACT	AUSTRALIAN NATIONAL UNI LPO	Post Office Boxes	-35.277272	149.117136
1	221	BARTON	ACT	NaN	LVR	-35.201372	149.095065
2	800	DARWIN	NT	DARWIN DELIVERY CENTRE	Delivery Area	-12.801028	130.955789
3	801	DARWIN	NT	DARWIN DELIVERY CENTRE	Post Office Boxes	-12.801028	130.955789
4	804	PARAP	NT	PARAP	Post Office Boxes	-12.432181	130.843310

Each row represents a Distribution Centre (dc) rather than a suburb or postcode. We're looking for a list of unique suburbs, so we'll filter the dataframe for unique suburbs and remove the 'dc' and 'type' column, which we don't need

In [12]:

postcodes.drop_duplicates(subset='suburb', inplace=True)
postcodes.drop(['dc', 'type'],axis=1,inplace=True )
postcodes.dropna(inplace=True) #drop the trailing row which has no data
postcodes.reset_index(inplace=True, drop=True) #reinex after having removed some rows above
postcodes.shape

Out[12]:

(14269, 5)

For simplicity we'll also create a smaller subset of suburbs, being those in Metropolitan Melbourne, the second-largest city in Australia. The postcode range for Melbourne are postcode between 3000-3207 & 8000-8499

In [13]:

melb_postcodes = postcodes[(postcodes['postcode']>=3000)&(postcodes['postcode']<=3207)|(postcodes['postcode']>=8000)&(postcodes['postcode']<=8499)]
melb_postcodes.head()

Out[13]:

	postcode	suburb	state	lat	lon
4901	3000	MELBOURNE	VIC	-37.814563	144.970267
4902	3002	EAST MELBOURNE	VIC	-37.816640	144.987811
4903	3003	WEST MELBOURNE	VIC	-37.806255	144.941123
4904	3005	WORLD TRADE CENTRE	VIC	-37.822262	144.954856
4905	3006	SOUTHBANK	VIC	-37.823258	144.965926

Assign venue clusters¶

Use the Foursquare API to get venue recommendations for each suburb and then use K-means to assign a venue cluster to the suburb based on the recommendations returned

In [14]:

#intialise Foursquare API credentials
CLIENT_ID = 'EKC0OWGJC1SY1AE1UHB4PUPH2JGARZTQK1U5C1USTUNA43JF' # your Foursquare ID
CLIENT_SECRET = 'CHH0EPUEN2PH4WBDV4XHTDT5NWTUZ2SYVIQALUOZXWYNABRO' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

In [15]:

#Use the function to extract the category from the dataframe (because the column name could be either 'categories' or 'venue.categories')
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [16]:

#define a function to return all the venues for a given neighborhood

#pass the neighborhood (nb), the latitude (la), the radius and the limit 
def get_venues(nb, la, lo, radius, limit):
    
    #print(nb)
    
    #form the request url and request only the items (which are the venues)
    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, la, lo, VERSION, limit)
    items = requests.get(url).json()["response"]['groups'][0]['items']
    
    #include exception handling where Foursquare request fails for a particular Neighbourhood
    if items == []:
        print(" -- Foursquare request for ",nb,"returned no results --")
        return None
    else:
        # flatten JSON, filter for only wanted columns then use the get_category_type funtion to replace the category list with just the category.
        venues1 = json_normalize(items)
        filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
        venues1 =venues1.loc[:, filtered_columns]
        venues1['venue.categories'] = venues1.apply(get_category_type, axis=1)

        #define a new dataframe with the Neighbourhood information
        venues2 = pd.DataFrame(columns=['Suburb','Suburb Latitude', 
                      'Suburb Longitude']) 

        #for each of the venues returned, add the venue dataframe to the (empty) neighbourhood dataframe and fill all rows in the the neighbourhood columns with the neighborhood name, the neighbourhood latitude and longitude
        i=0
        for row in venues1:
            venues2[row]=venues1[row]
            venues2['Suburb']=nb
            venues2['Suburb Latitude']=la
            venues2['Suburb Longitude']=lo
            i=i+1

        #rename the columns
        venues2.rename(index=str,columns={"venue.name":"Venue","venue.categories":"Venue Category","venue.location.lat":"Venue Latitude","venue.location.lng":"Venue Longitude"}, inplace=True)
    
    return venues2

In [17]:

#test out the get_venues function
row = 300
n = postcodes.loc[row]['suburb']
n_lat = postcodes.loc[row]['lat']
n_long = postcodes.loc[row]['lon']

df2=get_venues(n, n_lat, n_long, 500, 100)
df2.head()

Out[17]:

	Suburb	Suburb Latitude	Suburb Longitude	Venue	Venue Category	Venue Latitude	Venue Longitude
0	LANE COVE WEST	-33.806246	151.153281	Lane Cove Aquatic Centre	Pool	-33.813591	151.171442
1	LANE COVE WEST	-33.806246	151.153281	Via Napoli Pizzeria	Pizza Place	-33.815597	151.169879
2	LANE COVE WEST	-33.806246	151.153281	Lane Cove National Park	Park	-33.792008	151.152327
3	LANE COVE WEST	-33.806246	151.153281	North Shore Gym	Gym	-33.814679	151.169105
4	LANE COVE WEST	-33.806246	151.153281	Fourno	Café	-33.795317	151.143533

In [18]:

#Iterate thorough all the suburbs of the dataframe df, adding each set of venues to the df_suburb_venues dataframe
df=melb_postcodes
df_suburb_venues=pd.DataFrame()
radius = 500
limit = 100
print("Getting venue recommendations, please wait......")
for i, row in enumerate(df['suburb']):
        df_suburb_venues=df_suburb_venues.append(get_venues(df.iloc[i]['suburb'], df.iloc[i]['lat'], df.iloc[i]['lon'], radius, limit),ignore_index=True)
print("...... Complete")

Getting venue recommendations, please wait......
...... Complete

In [19]:

counts = df_suburb_venues['Venue Category'].value_counts()
counts.head(10)

Out[19]:

Café                    6132
Fast Food Restaurant    1723
Grocery Store           1455
Supermarket             1426
Sandwich Place          1018
Coffee Shop              917
Shopping Mall            800
Park                     794
Pizza Place              774
Pub                      773
Name: Venue Category, dtype: int64

In [20]:

#Prepare the data for clustering using one-hot encoding
df_onehot = pd.get_dummies(df_suburb_venues['Venue Category'])
#insert a new column for the Neighbourhood values (note that immediately reusing the 'Neighborhood' name causes as error as you can't add a Neighourhood column that already exists, so renamed the column post the insert)
df_onehot.insert(0,"New",df_suburb_venues['Suburb'])
df_onehot.rename(columns={'New':'Suburb'}, inplace=True)

#get the mean frequency occurance
df_onehot = df_onehot.groupby("Suburb").mean().reset_index()
df_onehot.head()

Out[20]:

	Suburb	Afghan Restaurant	Airport	...	Wine Bar	Wine Shop
0	ABECKETT STREET	0.000000	0.00000	...	0.010000	0.00
1	ABERFELDIE	0.000000	0.00000	...	0.000000	0.00
2	AIRPORT WEST	0.000000	0.01087	...	0.000000	0.00
3	ALBANVALE	0.010417	0.00000	...	0.010417	0.00
4	ALBERT PARK	0.000000	0.00000	...	0.010000	0.01

5 rows × 338 columns

In [21]:

#put the top10 into a dataframe
df_topvenues=pd.DataFrame(columns=['Suburb','#1','#2','#3','#4','#5','#6','#7','#8','#9','#10'])

for row in range(0,len(df_onehot)):
    suburb = df_onehot.iloc[row][0]
    topten = df_onehot.iloc[row][1:].sort_values(ascending=False).head(10)
    temp=pd.DataFrame([[suburb,topten.index[0],topten.index[1],topten.index[2],topten.index[3],topten.index[4],topten.index[5],topten.index[6],topten.index[7],topten.index[8],topten.index[9]]],columns=['Suburb','#1','#2','#3','#4','#5','#6','#7','#8','#9','#10'])
    df_topvenues=df_topvenues.append(temp)
df_topvenues.reset_index(inplace=True, drop=True)
df_topvenues.head()

Out[21]:

	Suburb	#1	#2	#3	#4	#5	#6	#7	#8	#9	#10
0	ABECKETT STREET	Korean Restaurant	Café	Coffee Shop	Japanese Restaurant	Bar	Bubble Tea Shop	Indonesian Restaurant	Donut Shop	Burger Joint	Fast Food Restaurant
1	ABERFELDIE	Café	Electronics Store	Grocery Store	Gym	Bakery	Coffee Shop	Shopping Mall	Pub	Japanese Restaurant	Pizza Place
2	AIRPORT WEST	Fast Food Restaurant	Supermarket	Café	Sandwich Place	Grocery Store	Portuguese Restaurant	Convenience Store	Middle Eastern Restaurant	Shopping Mall	Electronics Store
3	ALBANVALE	Fast Food Restaurant	Portuguese Restaurant	Café	Supermarket	Grocery Store	Bakery	Shopping Mall	Vietnamese Restaurant	Pizza Place	Chinese Restaurant
4	ALBERT PARK	Café	Pub	Japanese Restaurant	Gastropub	Burger Joint	Coffee Shop	Beach	Bakery	Fish & Chips Shop	Breakfast Spot

Assign a cluster based on the recommendations using the k-means clustering algorithm from Scikit Learn¶

In [22]:

# import k-means
from sklearn.cluster import KMeans

df_clustering = df_onehot.drop('Suburb',axis=1)

# set number of clusters
k = 5

# run k-means clustering
kmeans = KMeans(n_clusters=k, random_state=0).fit(df_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:100]

Out[22]:

array([0, 3, 4, 4, 3, 4, 3, 1, 3, 3, 2, 3, 4, 3, 3, 3, 3, 1, 1, 1, 2, 4,
       3, 4, 3, 3, 3, 3, 4, 1, 1, 2, 2, 3, 2, 1, 1, 1, 1, 1, 2, 1, 1, 3,
       1, 3, 3, 1, 1, 1, 2, 2, 1, 1, 3, 1, 1, 1, 1, 1, 3, 3, 3, 3, 4, 0,
       0, 0, 3, 0, 0, 1, 2, 3, 3, 4, 4, 2, 4, 4, 3, 3, 3, 3, 4, 0, 0, 1,
       4, 2, 2, 3, 3, 2, 3, 3, 3, 1, 1, 3], dtype=int32)

In [23]:

df_topvenues.insert(0,'Venues Cluster',kmeans.labels_)

In [24]:

df_merged = postcodes

df_merged = df_merged.join(df_topvenues.set_index('Suburb'), on='suburb')

#drop rows with no clusters
df_melbourne_merged = df_merged.dropna(subset=['Venues Cluster'])
df_melbourne_merged.head()

Out[24]:

	postcode	suburb	state	lat	lon	#1	#2	#3	#4	#5	#6	#7	#8	#9	#10
4901	3000	MELBOURNE	VIC	-37.814563	144.970267	Cocktail Bar	Italian Restaurant	Coffee Shop	Café	Japanese Restaurant	Hotel	Asian Restaurant	Sandwich Place	French Restaurant	BBQ Joint
4902	3002	EAST MELBOURNE	VIC	-37.816640	144.987811	Café	Vietnamese Restaurant	Cricket Ground	Tennis Stadium	Park	Breakfast Spot	Bakery	Vegetarian / Vegan Restaurant	Football Stadium	Thai Restaurant
4903	3003	WEST MELBOURNE	VIC	-37.806255	144.941123	Café	Korean Restaurant	Bar	Hotel	Middle Eastern Restaurant	Sandwich Place	Shopping Mall	Bakery	Burger Joint	Asian Restaurant
4904	3005	WORLD TRADE CENTRE	VIC	-37.822262	144.954856	Café	Japanese Restaurant	Coffee Shop	Hotel	Bar	French Restaurant	Steakhouse	Restaurant	Italian Restaurant	Australian Restaurant
4905	3006	SOUTHBANK	VIC	-37.823258	144.965926	Bar	Theater	Hotel	Italian Restaurant	Performing Arts Venue	Park	Japanese Restaurant	Café	Australian Restaurant	Burger Joint

Assign cultural clusters¶

Use the Australian Beaureu of Statistics API to get a breakdown on the ancestry for the suburb suburb and then use K-means to assign a venue cluster suburbs with similar ancestry

In [25]:

#create the url which returns the ABS cencus data for the year 2011 for the question T09 - Ancestry by birthplace of parents
allvic_url='ABS_CENSUS2011_T09/TOT+1+2+3+4+Z.TOT+TOTP+1101+1102+6101+3204+2303+2101+5201+2305+2306+3205+3304+7106+2201+3103+6902+4106+3206+3104+1201+1202+3307+3308+2102+3213+7115+9215+3106+4907+5107+2103+OTH+Z.2.SA2..A/all?detail=Full&dimensionAtObservation=AllDimensions'

In [26]:

#create request object
abs = Request('ABS')
#create response object
response = abs.data(allvic_url)
#create dataset object
dataset=response.data
##create top level series
data=response.write()
data.head()

Out[26]:

MEASURE  ANCP  STATE  REGIONTYPE  REGION     FREQUENCY  TIME_PERIOD
TOT      TOT   2      SA2         201011001  A          2001            7543.0
                                                        2006           10219.0
                                                        2011           10969.0
                                  205051101  A          2001           15263.0
                                                        2006           16144.0
dtype: float64

The data above is arranged in a single multi index column (with region being the postcode code and ANCP being the ancestry country code), so we need to pull the indexes into columns and then we can access the columns we need more easily:

In [27]:

#using reset_index removes all the levels of a multi-index column by default
df_ancestry = data.reset_index(name='Value')
print(df_ancestry.shape)
df_ancestry.head()

(265608, 8)

Out[27]:

	MEASURE	ANCP	STATE	REGIONTYPE	REGION	FREQUENCY	TIME_PERIOD	Value
0	TOT	TOT	2	SA2	201011001	A	2001	7543.0
1	TOT	TOT	2	SA2	201011001	A	2006	10219.0
2	TOT	TOT	2	SA2	201011001	A	2011	10969.0
3	TOT	TOT	2	SA2	205051101	A	2001	15263.0
4	TOT	TOT	2	SA2	205051101	A	2006	16144.0

In [28]:

#Take only 2011 data
df_ancestry = df_ancestry[(df_ancestry['TIME_PERIOD']=='2011')]

#The data includes six different measures of ancestry (father only, mother only, both).
#in this case the only measure i'm interested in the totals for each ancestry
df_ancestry = df_ancestry[(df_ancestry['MEASURE']=='TOT')&(df_ancestry['ANCP']!='TOT')]

#pivot the Ancestry to columns in preparation for k-means analysis
df_ancestry = df_ancestry.pivot(index='REGION',columns='ANCP', values='Value')

df_ancestry.head()

Out[28]:

ANCP	1101	1102	1201	1202	2101	2102	2103	2201	2303	2305	...	5107	5201	6101	6902	7106	7115	9215	OTH	TOTP	Z
REGION
201011001	3340.0	6.0	13.0	46.0	3387.0	940.0	54.0	1167.0	253.0	17.0	...	0.0	30.0	109.0	3.0	97.0	4.0	33.0	319.0	8195.0	373.0
201011002	4643.0	11.0	20.0	38.0	4979.0	1609.0	86.0	2201.0	293.0	43.0	...	4.0	36.0	201.0	9.0	159.0	5.0	35.0	583.0	12398.0	848.0
201011003	8635.0	32.0	22.0	70.0	8847.0	2535.0	137.0	3537.0	571.0	83.0	...	12.0	53.0	216.0	7.0	102.0	9.0	39.0	757.0	21516.0	1276.0
201011004	9146.0	48.0	36.0	107.0	9169.0	2531.0	147.0	2969.0	683.0	98.0	...	22.0	74.0	308.0	0.0	156.0	0.0	35.0	912.0	22579.0	1713.0
201011005	2554.0	6.0	8.0	39.0	2939.0	861.0	36.0	1103.0	236.0	41.0	...	0.0	19.0	52.0	4.0	40.0	9.0	27.0	240.0	6583.0	279.0

5 rows × 33 columns

In [29]:

# set number of clusters
k = 10

# run k-means clustering
kmeans = KMeans(n_clusters=k, random_state=0).fit(df_ancestry)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:100]

Out[29]:

array([5, 2, 0, 0, 1, 1, 8, 2, 1, 1, 5, 1, 8, 8, 8, 1, 1, 2, 5, 2, 5, 4,
       8, 1, 5, 1, 5, 5, 8, 5, 1, 8, 1, 8, 1, 1, 2, 0, 2, 4, 4, 4, 2, 5,
       2, 5, 2, 5, 1, 4, 1, 8, 2, 1, 1, 2, 5, 8, 1, 8, 8, 8, 5, 8, 8, 4,
       5, 8, 2, 8, 8, 1, 2, 0, 8, 2, 1, 1, 4, 8, 2, 1, 8, 5, 1, 1, 5, 8,
       5, 5, 5, 8, 0, 5, 4, 2, 0, 8, 8, 8], dtype=int32)

In [30]:

df_ancestry.insert(0,'Ancestry Cluster',kmeans.labels_)
df_ancestry.head()

Out[30]:

ANCP	Ancestry Cluster	1101	1102	1201	1202	2101	2102	2103	2201	2303	...	5107	5201	6101	6902	7106	7115	9215	OTH	TOTP	Z
REGION
201011001	5	3340.0	6.0	13.0	46.0	3387.0	940.0	54.0	1167.0	253.0	...	0.0	30.0	109.0	3.0	97.0	4.0	33.0	319.0	8195.0	373.0
201011002	2	4643.0	11.0	20.0	38.0	4979.0	1609.0	86.0	2201.0	293.0	...	4.0	36.0	201.0	9.0	159.0	5.0	35.0	583.0	12398.0	848.0
201011003	0	8635.0	32.0	22.0	70.0	8847.0	2535.0	137.0	3537.0	571.0	...	12.0	53.0	216.0	7.0	102.0	9.0	39.0	757.0	21516.0	1276.0
201011004	0	9146.0	48.0	36.0	107.0	9169.0	2531.0	147.0	2969.0	683.0	...	22.0	74.0	308.0	0.0	156.0	0.0	35.0	912.0	22579.0	1713.0
201011005	1	2554.0	6.0	8.0	39.0	2939.0	861.0	36.0	1103.0	236.0	...	0.0	19.0	52.0	4.0	40.0	9.0	27.0	240.0	6583.0	279.0

5 rows × 34 columns

The table above refers to REGION and ANCP (ancestry) by codes. We need to get the xml from the ABS which describes what each code value is and and use it to merge the name of the suburb back into the data.

In [31]:

#an xml parser is required to get the codes from the Australian Bureau of Statistics xml file
import xml.etree.ElementTree as ET

In [32]:

#the xml file of codes from the ABS
user_agent_url = 'http://stat.data.abs.gov.au/restsdmx/sdmx.ashx/GetDataStructure/ABS_CENSUS2011_T09'
xml_data = requests.get(user_agent_url).content

#create the xml tree from the file
tree = ET.fromstring(xml_data)

In [33]:

#Create a dataframe which will include all the codes
df_codes = pd.DataFrame(columns=['code','value'])

#parse the structure xml to get the codes:
for child in tree:
    for lower in child:
        for codelist in lower:
            for description in codelist:
                if(description.attrib.get('{http://www.w3.org/XML/1998/namespace}lang')=='en'):
                    code = codelist.attrib.get('value')
                    value = description.text
                    temp_df = pd.DataFrame([[code,value]],columns=['code','value'])
                    df_codes=pd.concat([df_codes,temp_df])

#set the code as the index of the table
df_codes.set_index('code', inplace=True)

#strip whitespaces from the values column
df_codes['value'] = df_codes['value'].str.strip()
df_codes.head(12)

Out[33]:

	value
code
TOT	Total responses(c)
1	Both parents born overseas
2	Father only born overseas
3	Mother only born overseas
4	Both parents born in Australia
Z	Birthplace not stated(b)
TOT	Total responses(c)
TOTP	Total persons(c)
1101	Australian
1102	Australian Aboriginal
6101	Chinese
3204	Croatian

In [34]:

#merge the codes back into the Ancestrt dataframe
df_ancestry = df_ancestry.join(df_codes)

We now have a dataframe of Suburbs with Ancestry clusters. We'll clean this dataframe up as well as the original dataframe containing the Venue Cluster so that we can merge them in the subsequent step on suburb name.

In [35]:

# select only the required columns, fix the names and make suburb lowercase for searching
df_ancestry = df_ancestry[['value','Ancestry Cluster']]
df_ancestry.rename(columns={"value":"suburb"}, inplace=True)

#strip out whitespace, change to lowercase
df_ancestry['suburb'] = df_ancestry['suburb'].str.lower()
df_melbourne_merged['suburb']=df_melbourne_merged['suburb'].str.lower()
df_ancestry.sort_values(by=['suburb'], inplace=True)
df_melbourne_merged.sort_values(by=['suburb'], inplace=True)

/opt/conda/envs/Python36/lib/python3.6/site-packages/ipykernel/__main__.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/opt/conda/envs/Python36/lib/python3.6/site-packages/ipykernel/__main__.py:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

In [36]:

#merge on suburb
df_complete = df_melbourne_merged.merge(df_ancestry, how='left', on='suburb')

#and pick only relevant columns
df_complete = df_complete[['postcode','suburb','state','Venues Cluster','Ancestry Cluster']]

#and see how many suburbs didn't find a matching Ancestry Cluster
df_complete[pd.isnull(df_complete['Ancestry Cluster'])]['suburb'].shape

Out[36]:

(296,)

The above shows that 296 of our suburbs did not have a entry in the Ancestry dataframe when we merge directly on Suburb name. We'll need to do some looser matching which gets this number down to 93. Note that the remaining 93 are not really suburbs - they are standalone postcodes used by Australia Post

In [37]:

#where suburb couldn't be matched directly, try to find the suburb name as a substing in df_ancestry:
for suburb in df_complete[pd.isnull(df_complete['Ancestry Cluster'])]['suburb']:
    if(len(df_ancestry.loc[df_ancestry['suburb'].str.contains(suburb)]['Ancestry Cluster'])>0):
        df_complete.loc[df_complete[df_complete['suburb']==suburb].index,'Ancestry Cluster']=df_ancestry.loc[df_ancestry['suburb'].str.contains(suburb)]['Ancestry Cluster'][0]

# where there is still no macth, try to split the suburb and find the sub string. This will allow, for example, 'Altona North' to get the cluster for 'Altona'
for suburb in df_complete[pd.isnull(df_complete['Ancestry Cluster'])]['suburb']:
    suburb2 = suburb.split()[0].strip()
    if(len(df_ancestry.loc[df_ancestry['suburb'].str.contains(suburb2)]['Ancestry Cluster'])>0):
        df_complete.loc[df_complete[df_complete['suburb']==suburb].index,'Ancestry Cluster']=df_ancestry.loc[df_ancestry['suburb'].str.contains(suburb2)]['Ancestry Cluster'][0]


#and see how many suburbs didn't find a matching Ancestry Cluster
df_complete[pd.isnull(df_complete['Ancestry Cluster'])]['suburb'].shape

Out[37]:

(93,)

In [38]:

#all the suburbs without an Ancestry Cluster are not actually recognisable suburbs, so we can drop these:
df_complete.dropna(inplace=True)

In [39]:

df_complete.head()

Out[39]:

	postcode	suburb	state	Venues Cluster	Ancestry Cluster
1	3040	aberfeldie	VIC	3.0	0.0
2	3042	airport west	VIC	4.0	1.0
4	3206	albert park	VIC	3.0	2.0
5	3020	albion	VIC	4.0	1.0
6	3078	alphington	VIC	3.0	1.0

Assign age and income brackets¶

Most real estate search tools offer suburb profiles which include information prospective buyers/renters would typically be interested in, such as recent sold prices as well as demographic information. This project will utilize limited demographic information scraped from RealestateVIEW.com.au; Median Age and Median Weekly Income.

In [40]:

#define a function which takes a suburb name and scrapes the web for median age and household income statistics
def get_age_income(state,suburb):
    
    #Format the url
    url='https://propertydata.realestateview.com.au/propertydata/suburb-profile/'+state+'/'
    suburb = suburb.split()
    i=0
    for i in suburb:
        url=url+i+"+"
    
    age = 0
    income = 0

    #get the page text for the suburb
    page = requests.get(url).text
    soup = BeautifulSoup(page, 'lxml')
    
    #find all the important elements
    sp = soup.find_all('b')
    
    #get the median age and income for the suburb
    for element in sp:
        if element.text=='Median Age':
            age = element.next_sibling.next_sibling.translate({ord(i): None for i in '\n\t'})
        if element.text=='Weekly Household Income':
            income = element.next_sibling.next_sibling.translate({ord(i): None for i in '\n\t'})
        
    return age, income

In [41]:

#insert columns for median age and weekly income
df_complete.insert(4,'Age',0)
df_complete.insert(4,'Income',0)

In [42]:

#fetch Median Age and Weekly Income from https://propertydata.realestateview.com.au/
print('Getting Median Age and Weekly Income from https://propertydata.realestateview.com.au/, please wait...')
for row in df_complete.itertuples():
    result = get_age_income('victoria',row.suburb)
    df_complete.loc[df_complete['suburb']==row.suburb,'Age']=result[0]
    df_complete.loc[df_complete['suburb']==row.suburb,'Income']=result[1]
print('...complete')

Getting Median Age and Weekly Income from https://propertydata.realestateview.com.au/, please wait...
...complete

In [43]:

df_complete.head()

Out[43]:

	postcode	suburb	state	Venues Cluster	Income	Age	Ancestry Cluster
1	3040	aberfeldie	VIC	3.0	2200.00	40	0.0
2	3042	airport west	VIC	4.0	1375.00	40	1.0
4	3206	albert park	VIC	3.0	2263.00	43	2.0
5	3020	albion	VIC	4.0	1010.00	33	1.0
6	3078	alphington	VIC	3.0	2123.00	38	1.0

We now have a complete list of suburbs with clusters assigned based on the venues in the area and the ancestry of the people, as well as the relative age and income of the inhabitants

In [44]:

#postcode and state are not required for the recommendation calculation
similar_sort = df_complete.drop(['postcode','state'], axis=1)

#Update the index and clean up the column types
similar_sort.set_index('suburb', inplace=True)
similar_sort['Age']=similar_sort.Age.astype(float)
similar_sort['Income']=similar_sort.Income.astype(float)

#normalize the data into a new dataframe
from sklearn.preprocessing import MinMaxScaler
x = similar_sort.values #returns a numpy array
min_max_scaler = MinMaxScaler()
feature_mtx = min_max_scaler.fit_transform(x)
feature_mtx [0:5]

df_normal = pd.DataFrame(feature_mtx, index=similar_sort.index, columns=similar_sort.columns)

In [45]:

# The following function takes in a suburb and returns the top 10 similar suburbs
def subrec(suburb):
    
    try:
        #choose a suburb for comparison
        new_suburb = suburb.lower()
        new_suburb_df = list(df_normal.loc[new_suburb])

        #get the absolute difference between the new suburb and all other suburbs
        temp_df = df_normal-new_suburb_df
        temp_df = temp_df.abs()
        recommendations = temp_df.sum(axis=1).sort_values().head(10)

        return recommendations

    except:
        print("Error: Suburb not found")

Using the function for a given suburb gives the similarity scores for the top 10 similar suburbs

In [46]:

suburb = 'Prahran'
results = subrec(suburb)
results

Out[46]:

suburb
prahran           0.000000
fitzroy north     0.194010
south yarra       0.254777
hawthorn east     0.286826
yallambie         0.296024
clifton hill      0.320766
coburg            0.337025
malvern east      0.339021
princes hill      0.393546
brunswick east    0.397217
dtype: float64

We can see more details for the top 10 including their feature values:

In [47]:

detailed_results = df_complete[df_complete['suburb'].isin(results.index)]
detailed_results

Out[47]:

	postcode	suburb	state	Venues Cluster	Income	Age	Ancestry Cluster
66	3057	brunswick east	VIC	0.0	1726.00	32	1.0
108	3068	clifton hill	VIC	0.0	2272.00	36	5.0
110	3058	coburg	VIC	0.0	1638.00	35	6.0
159	3068	fitzroy north	VIC	0.0	1920.00	36	5.0
185	3123	hawthorn east	VIC	0.0	1976.00	34	2.0
247	3145	malvern east	VIC	1.0	1937.00	36	4.0
311	3181	prahran	VIC	0.0	1850.00	33	4.0
317	3054	princes hill	VIC	0.0	1857.00	36	1.0
350	3141	south yarra	VIC	0.0	1813.00	32	2.0
414	3085	yallambie	VIC	1.0	1871.00	35	4.0

Map the results¶

In [2]:

!pip install folium

Collecting folium
  Downloading https://files.pythonhosted.org/packages/4f/86/1ab30184cb60bc2b95deffe2bd86b8ddbab65a4fac9f7313c278c6e8d049/folium-0.9.1-py2.py3-none-any.whl (91kB)
     |████████████████████████████████| 92kB 14.3MB/s eta 0:00:01
Requirement already satisfied: requests in /opt/conda/envs/Python36/lib/python3.6/site-packages (from folium) (2.21.0)
Requirement already satisfied: jinja2>=2.9 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from folium) (2.10)
Requirement already satisfied: numpy in /opt/conda/envs/Python36/lib/python3.6/site-packages (from folium) (1.15.4)
Collecting branca>=0.3.0 (from folium)
  Downloading https://files.pythonhosted.org/packages/63/36/1c93318e9653f4e414a2e0c3b98fc898b4970e939afeedeee6075dd3b703/branca-0.3.1-py3-none-any.whl
Requirement already satisfied: idna<2.9,>=2.5 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from requests->folium) (2.8)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from requests->folium) (2019.6.16)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from requests->folium) (3.0.4)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from requests->folium) (1.24.1)
Requirement already satisfied: MarkupSafe>=0.23 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from jinja2>=2.9->folium) (1.1.0)
Requirement already satisfied: six in /opt/conda/envs/Python36/lib/python3.6/site-packages (from branca>=0.3.0->folium) (1.12.0)
Installing collected packages: branca, folium
Successfully installed branca-0.3.1 folium-0.9.1

In [3]:

import folium # map rendering library

In [5]:

latitude=-37.814563
longitude=144.970267
print(latitude,longitude)

-37.814563 144.970267

In [54]:

detailed_results = df_melbourne_merged[df_melbourne_merged['suburb'].isin(results.index)]

In [55]:

detailed_results

Out[55]:

	postcode	suburb	state	lat	lon	Venues Cluster	#1	#2	#3	#4	#5	#6	#7	#8	#9	#10
5016	3057	brunswick east	VIC	-37.764910	144.979567	0.0	Café	Bar	Bakery	Pizza Place	Mexican Restaurant	Beer Garden	Vietnamese Restaurant	Italian Restaurant	Grocery Store	Gastropub
5036	3068	clifton hill	VIC	-37.788118	144.992067	0.0	Café	Bakery	Pub	Vegetarian / Vegan Restaurant	Italian Restaurant	Bar	Coffee Shop	Gastropub	Cocktail Bar	Ice Cream Shop
5018	3058	coburg	VIC	-37.743188	144.966279	0.0	Café	Bar	Bakery	Middle Eastern Restaurant	Grocery Store	Vietnamese Restaurant	Beer Garden	Thai Restaurant	Pizza Place	Mexican Restaurant
5037	3068	fitzroy north	VIC	-37.783400	144.984688	0.0	Café	Bar	Pub	Bakery	Italian Restaurant	Vegetarian / Vegan Restaurant	Coffee Shop	Ice Cream Shop	Gastropub	Wine Shop
5120	3123	hawthorn east	VIC	-37.782254	145.001811	0.0	Café	Bakery	Vegetarian / Vegan Restaurant	Pub	Pizza Place	Indian Restaurant	Gastropub	Italian Restaurant	Bar	Coffee Shop
5184	3145	malvern east	VIC	-37.878370	145.067892	1.0	Café	Korean Restaurant	Park	Supermarket	Electronics Store	Breakfast Spot	Chinese Restaurant	Thai Restaurant	Malay Restaurant	Fast Food Restaurant
5253	3181	prahran	VIC	-37.849577	144.993714	0.0	Café	Italian Restaurant	Japanese Restaurant	Bar	Coffee Shop	Pub	Deli / Bodega	Pizza Place	Clothing Store	French Restaurant
5008	3054	princes hill	VIC	-37.780995	144.962792	0.0	Café	Bar	Zoo Exhibit	Italian Restaurant	Bakery	Pub	Vietnamese Restaurant	Dog Run	Gastropub	Grocery Store
5173	3141	south yarra	VIC	-37.837883	144.991123	0.0	Café	Italian Restaurant	French Restaurant	Japanese Restaurant	Bakery	Hotel	Dessert Shop	Coffee Shop	Pub	Korean Restaurant
5071	3085	yallambie	VIC	-37.727482	145.102309	1.0	Café	Fast Food Restaurant	Grocery Store	Supermarket	Sandwich Place	Convenience Store	Pub	Art Gallery	Shopping Mall	Burger Joint

In [70]:

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

colors_list = ['Red','Blue','Green','Yellow','Purple','Pink','Orange']

#add markers to the map
markers_colors = []
for lat, lon, suburb in zip(detailed_results['lat'], detailed_results['lon'], detailed_results['suburb']):
    label = folium.Popup(str(suburb), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        #color=colors_list[int(cluster)],
        fill=True,
       fill_opacity=0.7).add_to(map_clusters)

       
map_clusters

Out[70]:

Dataist Dogma

Recommender system: Australian suburbs using clustering and the Foursquare API

Get suburb and postcode data¶

Assign venue clusters¶

Assign a cluster based on the recommendations using the k-means clustering algorithm from Scikit Learn¶

Assign cultural clusters¶

Assign age and income brackets¶

Map the results¶

Recommender system: Australian suburbs using clustering and the Foursquare API

Get suburb and postcode data¶

Assign venue clusters¶

Assign a cluster based on the recommendations using the k-means clustering algorithm from Scikit Learn¶

Assign cultural clusters¶

Assign age and income brackets¶

Recommend a Suburb¶

Map the results¶